This report explores a dataset containing informations about 4898 white wines provided by Cortez et al. (2009). Each observation is described by its chemical properties and experts quality review.
First, some numbers about the dataset.
## [1] 4898 13
The dataset contains 4898 observation with 13 attributes including one for index and another for the quality grade.
Below, the summary statics about these 13 attributes.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
Each wine is rated in a 0 (very bad) - 10 (very excellent) grade by at least 3 wine experts.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
We see a normal distribution in the quality attribute histogram. Most wines are reviewed with quality around 6. The maximun quality observed is 9 while minumum is 3. No wine received a perfect 10 review and just a few has got a 9 review. Let’s try to understand throughout this EDA report what factors produce better wines, accorgind to experts opinions.
Now let’s visualize the acidity chemical properties of our dataset wines.
It is noticed a normal distribution in the acidic wines attributes, however we can see too many residues values to the right causing the maximum values to be distant from median values. Can the less frequency bins indicate a higher quality wine once most wines present median attributes values? Maybe in multivariate analysis we will find some relationship between these attributes and the wine quality.
Following the univariate exploration, let’s plot the histogram distribution fo the sulfur and sulphate attributes.
Again, as we have seen for the acidic attributes, most wine have same characteristics regarding its sulphates attributes given that accumulate around median values. Soon, as the quality distribution is approximatelly normal, we may expect that wine with atypical acidic attributes are better reviewed than median ones and we will test in multivariate analysis.
Let’s plot the related pH and alcohol attributes.
Trough the plots we can see that the distribution curve of pH attribute is normal while alcohol curve is left skewed. It is interisting that both plots don’t present same distribution since I’ve expected that alcohol content would be highly related to the pH. However, the different distribution shapes show the opposite.
Finally, let’s explore the remaining attributes distributions: residual sugar, chlorides and density.
We can see that all attributes present a approximattely normal distribution with execption of residual sugar attribute. Residual sugar is highly left skewed. In order to see residual sugar behaviour on others regions, let’s rearrange our plot with a log scale on x-axis.
With log10 x scale it’s possible to notice a bimodal distribution with most wines having residual sugar between 1 and 2 or between 7 and 15 showing that some wines present higher sugar amount than others.
Before starting to analyse the bivariate plots, let’s produce the correlation matrix using pearson method in order to have initial correlation indexs between attributes and to focus our multivariate analysis on those attributes.
## X fixed.acidity volatile.acidity
## X 1.000000000 -0.25581431 0.002857966
## fixed.acidity -0.255814305 1.00000000 -0.022697290
## volatile.acidity 0.002857966 -0.02269729 1.000000000
## citric.acid -0.149899918 0.28918070 -0.149471811
## residual.sugar 0.006623775 0.08902070 0.064286060
## chlorides -0.045645192 0.02308564 0.070511571
## free.sulfur.dioxide -0.011928911 -0.04939586 -0.097011939
## total.sulfur.dioxide -0.161979037 0.09106976 0.089260504
## density -0.185976097 0.26533101 0.027113845
## pH -0.115774132 -0.42585829 -0.031915368
## sulphates 0.009807759 -0.01714299 -0.035728147
## alcohol 0.213656245 -0.12088112 0.067717943
## quality 0.035763247 -0.11366283 -0.194722969
## citric.acid residual.sugar chlorides
## X -0.149899918 0.006623775 -0.04564519
## fixed.acidity 0.289180698 0.089020701 0.02308564
## volatile.acidity -0.149471811 0.064286060 0.07051157
## citric.acid 1.000000000 0.094211624 0.11436445
## residual.sugar 0.094211624 1.000000000 0.08868454
## chlorides 0.114364448 0.088684536 1.00000000
## free.sulfur.dioxide 0.094077221 0.299098354 0.10139235
## total.sulfur.dioxide 0.121130798 0.401439311 0.19891030
## density 0.149502571 0.838966455 0.25721132
## pH -0.163748211 -0.194133454 -0.09043946
## sulphates 0.062330940 -0.026664366 0.01676288
## alcohol -0.075728730 -0.450631222 -0.36018871
## quality -0.009209091 -0.097576829 -0.20993441
## free.sulfur.dioxide total.sulfur.dioxide density
## X -0.0119289106 -0.161979037 -0.18597610
## fixed.acidity -0.0493958591 0.091069756 0.26533101
## volatile.acidity -0.0970119393 0.089260504 0.02711385
## citric.acid 0.0940772210 0.121130798 0.14950257
## residual.sugar 0.2990983537 0.401439311 0.83896645
## chlorides 0.1013923521 0.198910300 0.25721132
## free.sulfur.dioxide 1.0000000000 0.615500965 0.29421041
## total.sulfur.dioxide 0.6155009650 1.000000000 0.52988132
## density 0.2942104109 0.529881324 1.00000000
## pH -0.0006177961 0.002320972 -0.09359149
## sulphates 0.0592172458 0.134562367 0.07449315
## alcohol -0.2501039415 -0.448892102 -0.78013762
## quality 0.0081580671 -0.174737218 -0.30712331
## pH sulphates alcohol quality
## X -0.1157741316 0.009807759 0.21365624 0.035763247
## fixed.acidity -0.4258582910 -0.017142985 -0.12088112 -0.113662831
## volatile.acidity -0.0319153683 -0.035728147 0.06771794 -0.194722969
## citric.acid -0.1637482114 0.062330940 -0.07572873 -0.009209091
## residual.sugar -0.1941334540 -0.026664366 -0.45063122 -0.097576829
## chlorides -0.0904394560 0.016762884 -0.36018871 -0.209934411
## free.sulfur.dioxide -0.0006177961 0.059217246 -0.25010394 0.008158067
## total.sulfur.dioxide 0.0023209718 0.134562367 -0.44889210 -0.174737218
## density -0.0935914935 0.074493149 -0.78013762 -0.307123313
## pH 1.0000000000 0.155951497 0.12143210 0.099427246
## sulphates 0.1559514973 1.000000000 -0.01743277 0.053677877
## alcohol 0.1214320987 -0.017432772 1.00000000 0.435574715
## quality 0.0994272457 0.053677877 0.43557472 1.000000000
Through the matrix we can see strong correlations between some attributes. For instance, density and alcohol present linear correlation of -0.78. Density and residual sugar, linear correlation of 0.84. Quality and alcohol, linear correlation of 0.43.
In order to help in our multivariate exploration, let’s bucket the residual sugar and sulphates buckets so that we color others scatter plots attributes using these buckets as references.
df_ww$quality <- factor(df_ww$quality)
df_ww$sulphates.bucket <- cut(df_ww$sulphates, breaks=c(seq(0.2,1.1,0.2)))
df_ww$residual.sugar.bucket <- cut(df_ww$residual.sugar, breaks=c(seq(0,14,3)))
str(df_ww$sulphates.bucket)
## Factor w/ 4 levels "(0.2,0.4]","(0.4,0.6]",..: 2 2 2 1 1 2 2 2 2 2 ...
str(df_ww$residual.sugar.bucket)
## Factor w/ 4 levels "(0,3]","(3,6]",..: NA 1 3 3 3 3 3 NA 1 1 ...
Density is expected to have strong linear relationship with residual sugar and alcohol, given that these last two attributes alter the wine water density. Let’s plot scatter plots in oder to visualize these realtionships.
As expected, density present strong linear correlation with residual sugar and alcohol. Given that these two factors acts directly in changing the water density, due to chemical concepts, it’s possible to say that density holds a causation relationship with them.
Also, through the matrix, we see a strong corrlearion index between pH and fixed acidity. Again, chemical concepts support these relationship once acidity influences the pH substance. Let’s visualize it.
The correlation is confirmed and we notice that most wines have pH between 3.0 and 3.3. Also, the relationship is not so strong as expected. Maybe others attributes besides fixed acidity are influencing the wines pH.
First, let’s see trough a scatter plot how sulphates effect the pH vs Fixed Acidity distribution.
It is hard to detect some pattern on how sulphates alter a pH wine. Trough the previous plot it is not possible to find any tendency.
Proceeding with the pH vs fixed acidity exploration, let’s now color our scatter plot with the residual sugar attribute as reference.
Through the previous plot, we see a tendency where wines with higher residual sugar amount present lower pH (more acid). However this tendency is weak and we can’t see a clear pattern of how residual sugar influences the pH. This means that in addition to the fixed acidity, residual sugar may be acting in a wine pH, despite not having strong linear correlation.
Now let’s explore how wine quality relates with some attributes. First, as detected by the correlation matrix, quality presents strong linear correlation with alcohol. Let’s visualize this relation in order to confirm it.
In fact, we see that wines with more alcohol presence tend to be better reviewed by the experts.
This plot show the interisting strong correlation between alcohol. It’s easy to notice that wine quality inceases as alcohol volume increases and density decreases. However, we see some outliers of good quality with low alcohol amount and high density. Maybe others factors like residual sugar and fixed acidty also influence in wine quality altough not presenting strong linear correlation.
So, in order to explore which others factor make a high quality wine, let’s produce other plots replacing density by them.
Those outliers are still there with high residual sugar amount and low alcohol. So residual sugar by itself doesn’t explain them.
Let’s see now volatile acidity.
This plot show a interesting slight tendency where wine with low volatile acidity amount on lower alcohol present better quality.
It is really difficult to find some pattern on total sulfur dioxide’s influence on wine quality. As detected by previous scatter multivariate plots, alcohol strong correlates with wine quality, however we don’t see clearly in the this plot how total sulfur dioxide exactly acts on wine quality.
Through this plot we can see that wine quality strongly correlates with alcohol amount. Quality tend to increase as alcohol increases. Moreover, according to the correlation matrix produced previously on this report, alcohol is the attribute that presents the most substantial correlation with quality wine.
The correlation matrix also identified really strong correlations between density and alcohol or residual sugar. In fact, this is exepected given that alcohol and residual sugar alter the water wine density due chemical concepts. Visualizing this relationship on the previous plots, it is possible to confirm the correlation showing that density tend to increase as residual sugar increases or tend to decrease as alcohol increases.
Again we see the relevant correlation between alcohol and quality, however this plot also show a slight tendency where quality tend to increase as volatile acidity decreases at lower alcohol volumes amounts. This fact shows that, besides alcohol, others attributes contribute to the wine quality and when mixed may determine a high quality wine.
The dataset provided by Cortez et al. (2009) contains attributes of 4,898 white wines. With this data collectiokn it was possible to explore how this chemical atributes correlate bewteen them and also how they impact in a white wine quality. Each wine was reviewed by experts with grades between 1 and 10. Initially, the exploration approached the attributes distribution with univariate histogram plots aid. Most attributes presented normal distribution with residual sugar exception. After that, through the matrix correlation and bivariate scatter plots, the correlations between the attributes were analyzed. Some chemical concepts were confirmed by really strong correlations, as density depending on residual sugar and alcohol. Also, still on bivariate analysis, it was surprising to discover that alcohol produced the highest correlation with quality wine. It was really hard to detect some pattern in correlation between wine quality and other attributes, but thanks to multivariate scatter plots, it was possible to notice that other attributes also acts on wine quality. Volatile acidity contibutes in a higher wine quality given that low volatile acidity tends to increase quality. However, others multivariates plots failed in showing strong patterns and correlation and was not possible to gather some insights and resolutions trough them. Thinking in future works, this dataset could be pontentialized with more wine attributes observations so that a quality prediction model could be made using machine learning techniques.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.